In this project my objective was to find patterns in past or current loans that might be correlated with successful completion or default of a current loan. To this end I divided the loans into two groups, the “Completed”/“Defaulted” and the “Current”, and I investigated these two groups separately. In used univariate analysis to get a better sense of the dataset and multivariate analysis, along with statistical tests when necessary, to uncover possible correlations. Finally, I focused my analysis, in most parts, on those individuals that I considered as more prone to default.
I investigated the Loan Data from Prosper. As the description in the project page states this data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
The main variables which I explored were:
1. LoanStatus
2. EmploymentStatus
3. CurrentDelinquencies
4. DelinquenciesLast7Years
5. AmountDelinquent
7. AvailableBankcardCredit
8. DebtToIncomeRatio
9. IncomeRange
10. StatedMonthlyIncome
11. LoanCurrentDaysDelinquent
12. LoanOriginalAmount
13. MonthlyLoanPayment
The total population is 113,937 individuals
n <- length(loans$ListingKey)
n
## [1] 113937
First I wanted to investigate the LoanStatus variable. For higher clarity of the bar plot, below I printed the four most frequent values:
1. Chargedoff
2. Completed
3. Current
4. Defaulted
We see that current and completed loans consist the majority of the data.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Here we can see that about 50% of the individuals are homeowners.
summary(loans$IsBorrowerHomeowner)
## False True
## 56459 57478
Here I plotted the employment status of the individuals. From the subsequent analysis I excluded those with employment status either empty or not available.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
Here we see the number of delinquencies for the individuals (current and total in 7 years). The majority of the population never had any delinquencies (about 67%), while about 79% of the population doesn’t currently have any delinquencies.
Since the majority in both cases is zero, we only show those that have or had at least one delinquency. Both plots are in the same scale for easier comparison.
summary(loans$CurrentDelinquencies)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5921 0.0000 83.0000 697
zero_current_delinquencies <- subset(loans$CurrentDelinquencies,
loans$CurrentDelinquencies == 0)
length(zero_current_delinquencies)
## [1] 89742
summary(loans$DelinquenciesLast7Years)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
zero_total_delinquencies <- subset(loans$DelinquenciesLast7Years,
loans$DelinquenciesLast7Years == 0)
length(zero_total_delinquencies)
## [1] 76439
There are a lot of individuals with a delinquent amount minor or equal to zero. Here I wanted to see the distribution for those that have delinquent amounts higher than 10,000 dollars (2,584 individuals) and thus they might be more prone to default. Below we also see a summary of the results.
summary(loans$AmountDelinquent)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 984.5 0.0 463900.0 7622
high_delinquency <- 10000
high_delinquency_individuals <- subset(loans$AmountDelinquent,
loans$AmountDelinquent > high_delinquency)
length(high_delinquency_individuals)
## [1] 2584
summary(high_delinquency_individuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10010 13430 19660 31170 33560 463900
Here I show the available bank card credit. My interest is to see beter those with the lowest credit, so I plotted only those with less than 50,000.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 880 4100 11210 13180 646300 7544
Here I wanted to look more into those with the highest exposure (debt 50% or higher than their income). We see that there is an unexpected peak at 10.0. That point either is an error or it refers to all those that have debt to income ratio higher than 10.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The income range of the individuals.
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 621 7274 32192 31050 16916
## $100,000+ Not employed Not displayed
## 17337 806 7741
Again I wanted to look more in detail on those with less income. For instance, those with yearly income less than 50,000. Intrestignly there is a peak in zero. These people are eiher not reporting their income correctly or they are funded from someone else.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
Loan current days delinquent (more than 0 days).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 152.8 0.0 2704.0
Loan original amount. The peaks in the histogram are due to the fact that people tend to loan well rounded amounts.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Monthly loan payment (greater than 0).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
I explored the differences in the debt to income ratio of those with completed and defaulted loans. Since the original plot is very skewed, I made a square root transformation on the x axis. From the plot alone we can see that those with defaulted loans tended to have a higher debt to income ratio. However, in order to be sure I performed a t-test in the two subsets (excluding those with zero ratio). The p-value of the test was << 0.001 and thus the difference observed is statistically significant.
##
## Welch Two Sample t-test
##
## data: loans.loan_status_com$DebtToIncomeRatio and loans.loan_status_def$DebtToIncomeRatio
## t = -7.1788, df = 5295.1, p-value = 8.002e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.13395587 -0.07648756
## sample estimates:
## mean of x mean of y
## 0.2642619 0.3694836
Previously I investigated the complete range. Now I want to zoom in and investigate in the same way two ranges of debt to income ratio; (0, 0.5] and (0.5, 10].
I start with (0, 0.5]. The p-value for this subset is again << 0.001 and thus the difference observed in the plot is statistically significant.
##
## Welch Two Sample t-test
##
## data: loans.loan_status_com$DebtToIncomeRatio and loans.loan_status_def$DebtToIncomeRatio
## t = -10.491, df = 5234.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02399709 -0.01644091
## sample estimates:
## mean of x mean of y
## 0.2004478 0.2206668
Here I explored the second range; (0.5, 10]. Running the statistical test we see that the p-value is >> 0.05, and thus those with debt to income ratio > 0.5 were not more likely to default or not.
##
## Welch Two Sample t-test
##
## data: loans.loan_status_com$DebtToIncomeRatio and loans.loan_status_def$DebtToIncomeRatio
## t = -0.73244, df = 607.92, p-value = 0.4642
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.14209585 0.06489678
## sample estimates:
## mean of x mean of y
## 0.8748787 0.9134782
In this section, I explored the relationship of the “Completed” and “Defaulted” loans, with their respective loantakers’ employment status and other parameters. I analyzed those with loan status completed or defaulted to see whether I could find patterns that would allow me to predict what will happen with those still repaying their loans.
First, I investigated the total delinquencies for the last 7 years. I excluded from the analysis all individuals with zero delinquencies. We can see that in most of the cases the median values for both completed and defaulted loans are identical. Only in the “Not employed” category the median delinquencies for the defaulted seem to differ a lot from the completed. In order to evaluate whether this difference is significant I ran a t-test. The resulting p-value is 0.56, thus I failed to reject the null hypothesis.
##
## Welch Two Sample t-test
##
## data: loans.not_employed_com$DelinquenciesLast7Years and loans.not_employed_def$DelinquenciesLast7Years
## t = -0.61616, df = 5.1074, p-value = 0.5642
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.501150 5.196984
## sample estimates:
## mean of x mean of y
## 8.947917 10.600000
Next, I looked at the original amount of loaned money. In the same way, medians apeared to be very similar apart from the “Self-employed” category. For that category I performed again a t-test to see whether the difference was statistically significant. The resulting p-value was << 0.001, thus the difference is statistically significant. So self employed individuals who defaulted, took bigger loans than those who didn’t default.
##
## Welch Two Sample t-test
##
## data: loans.self_employed_com$LoanOriginalAmount and loans.self_employed_def$LoanOriginalAmount
## t = -8.0962, df = 281.22, p-value = 1.729e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4927.606 -3000.126
## sample estimates:
## mean of x mean of y
## 6334.085 10297.951
In this plot I show the stated monthly income of those with less than 50,000 dollars per year. Interestigly people with full-time employment reported having zero monthly income. This is for sure some kind of error either in the way the data was stored or reported. In this plot all the median values are very similar to each other.
Here I explored the past delinquencies along with whether the individual was a homeowner or not. I wanted to see whether possible mortgages could play any role in the likelihood to default. From the box plots we can see that both median values and variances are very simmilar for all groups. I excluded from the analysis those with zero delinquencies.
In this plot I investigated the on time prosper payments for each employment status. Again everything seems to be very similar apart from the “Not employed”. However the p-value is about 0.77 and thus this variable does not show any differences as well.
##
## Welch Two Sample t-test
##
## data: loans.not_employed_com$OnTimeProsperPayments and loans.not_employed_def$OnTimeProsperPayments
## t = -0.35613, df = 1.2291, p-value = 0.7732
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -54.53195 50.03195
## sample estimates:
## mean of x mean of y
## 17.75 20.00
The majority of loans are 100% funded and thus I was interested to investigate those not fully funded. The “Not employed” category has clear differences between completed and defaulted loans, however the datapoints are too few to draw any statistical conclusion. For the rest of the categories either there is not a big difference between “Completed” and “Defaulted” or the “Defaulted” category is completely absent.
In terms of the number of investors in each loan, again the results are pretty similar for all categories apart from the “Self-employed”. The p-value is << 0.001 and thus we conclude that self employed with more investors were more likely to default than the rest of the self-employed.
##
## Welch Two Sample t-test
##
## data: loans.self_employed_com$Investors and loans.self_employed_def$Investors
## t = -6.9578, df = 283.39, p-value = 2.398e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -86.21806 -48.19285
## sample estimates:
## mean of x mean of y
## 106.8799 174.0854
Here I explored the relationship of the debt to income ratio of those that currently have an active loan, with their employment status and other parameters. I divided all individuals into two groups; those with low (<= 0.5) and those with high debt to income ratio (> 0.5). I created a bucket to put these individuals into and plotted in the same style as in the previous section. In the following plots I didn’t include NA and zero values (when applicable).
In this first plot I investigated whether the current delinquencies can reveal any pattern for those currently holding a loan. The plot shows only until 15 delinquencies but all higher than zero were included in the analysis. Making a statistical t-test to all the the individuals of different employment status we find no differences in those “Employed” and “Full-time” (p > 0.5). The “Other” have a statistically significant difference (p << 0.001) and the “Retired” a significant difference of 0.01 < p < 0.05. Finally, for the “Part-time” there are not enough observations and the “Self-employed” are only in low debt to income ratio.
(Example: t-test for the “Retired”)
##
## Welch Two Sample t-test
##
## data: current_loans.employment_low$CurrentDelinquencies and current_loans.employment_high$CurrentDelinquencies
## t = 2.3041, df = 16, p-value = 0.03496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.05642235 1.35534236
## sample estimates:
## mean of x mean of y
## 1.705882 1.000000
Here I explored those who had delinquent amounts higher than 1,000 dollars (plot shows until 100,000 dollars but I included all higher amounts in the analysis). Those “Employed” and “Other” were similar for both bucket groups, while those “Full-time” had a statistically significant difference (p << 0.001). All the rest do not have enough observations to perform a t-test.
(Example: t-test for the “Full-time”)
##
## Welch Two Sample t-test
##
## data: current_loans.employment_low$AmountDelinquent and current_loans.employment_high$AmountDelinquent
## t = 6.1277, df = 147.38, p-value = 7.746e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 10112.69 19740.29
## sample estimates:
## mean of x mean of y
## 17728.156 2801.667
Following the observation that self employed individuals with more investors were more likely to default on their loans, I wanted to explore more that group. I looked at those currently having an active loan to see whether higher number of investors was correlated with higher debt to income ratio. The p-value of the test was >> 0.05 and thus there was no correlation. However, the retired with low debt to income ratio have a higher number of investors, which is statistically significant (0.01 > p > 0.001).
(Example: t-test for the “Retired”)
##
## Welch Two Sample t-test
##
## data: current_loans.employment_low$Investors and current_loans.employment_high$Investors
## t = 3.3541, df = 39.7, p-value = 0.001761
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 16.48478 66.50182
## sample estimates:
## mean of x mean of y
## 70.92188 29.42857
Even though there was no correlation between the number of investors and the debt to income ratio in the self employed category, I wanted to investigate whether there were correlations with the income range instead. In the next plot I excluded the “Not employed”, the “Part-time” and the “Retired” because I wanted to focus on the rest of the categories which have more data. Below I printed how the income range is correlated with the number of investors per employment status. There are again no statistically significant differences between the groups.
(Example: t-test for the “Full-time”)
##
## Welch Two Sample t-test
##
## data: investors.income_smpl_1$Investors and investors.income_smpl_2$Investors
## t = -1.9747, df = 3.4647, p-value = 0.1304
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -137.12562 27.24842
## sample estimates:
## mean of x mean of y
## 54.2500 109.1886
The last plot of the section shows the original amount of the loan (for those who took more than 5,000 dollars) against the same variables. In this investigation, we see that the “Employed” and the “Other” were the ones that had statistically significant differences between the bucket groups.
(Example: t-test for the “Employed”)
##
## Welch Two Sample t-test
##
## data: current_loans.employment_low$LoanOriginalAmount and current_loans.employment_high$LoanOriginalAmount
## t = 19.191, df = 1201, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1787.741 2194.905
## sample estimates:
## mean of x mean of y
## 13720.21 11728.88
In this section, I present the three figures that I believe are the most informative for the dataset I investigated.
This plot shows the number of investors in a finished loan, grouped by the loantakers’ employment status and coloured by the loan status. Even though most categories were equally likely to default, this plot is interesting because of the “Self-employed” category. Self employed individuals with more investors seemed to be more likely to default than the rest, which is a counterintuitive result.
In this plot we see the number of investors in a current loan, grouped by the loantakers’ employment status (excluding retired, part-time and no employed) and coloured by their income range. This plot shows that the number of investors in a loan are not dependent on the loantakers’ income (all statistical tests had p > 0.05). We saw in Fig. 1 that self employed loantakers with more investors were more likely to default. Interestigly their income is not correlated with the number of investors involved in their loan.
In this plot we see the current delinquent amounts (> 1,000 dollars) for individualts with loans, grouped by their employment status and coloured by their debt to income ratio. We can observe a statistically significant difference only in the “Full-time” status, where loantakers have lower debt to income ratio. From the “Part-time” and “Self-employed” categories we see that only those of low debt to income ratio have delinquent amounts, but they are still only a few. Also, few “Retired” people have delinquent amounts. Finally there are no significant differences in the “Employed” and “Other” groups.
This dataset pertains to loan data for about 114,000 individuals. It contains 81 variables of which I explored around 13 of them. At first I explored the variables individually, mainly focusing on subsets of the population that I assumed to be more prone to default (low income, with delinquent payments and high debt to income ratio). Continuing, I assumed that if I explored characteristics for individuals with completed and defaulted loans, I could maybe predict whether those currently holding loans are going to default or not (only those in the category “Current” as it is the most populus).
I grouped those with completed or defaulted loans according to their employment status because I wanted to see whether this could also be a predicting factor of default. For that part of the analysis I explored variables pertaining to their delinquencies the last 7 years, original loaned amount, whether they own a home or not, investors involved etc. The first thing I observed is that there doesn’t seem to be any correlation between defaults and employment status. Furthermore, in almost all the variables I observed, both completed and defaulted loans had no statistically significant differences. Afterwards, I investigated those with loan status equal to “Current”. I wanted to see whether the currrent income, the debt to income ratio, the employment status, the loaned amount and the delinquencies will show any patterns among the population. The results of that analysis showed again no statistically significant differences in most of the values observed. From this I concluded that either there are not enough relevant variables in order to make inferences or that events which decide whether a loan defaults or not are equally possible to happen to everyone.
During the analysis it was very difficult for me to find meaningful variables to compare. As I set my goal to uncover possible driving factors that make loans default, I expected to find correlations between e.g. income or delinquent amounts and the probability for a loan to default. After extensive exploration of a lot of the variables I found only a few statistical significant differences between groups. On the one hand this was a success because these differences were very difficult to uncover, on the other hand this lack of pattern might be due to the fact that the important pieces of information responsible for the defaults are not included in this dataset. Furthermore, this dataset is limited in the sense that it doesn’t include historical data so that I could better investigate those with completed or defaulted loans at the time these loans were active. Finally, the dataset does not contain all the loans that individuals might have and it would be very interesting if these data could be compared with data from other sources to explore all possible loans for all individuals.
R markdown documentation: http://rmarkdown.rstudio.com/lesson-1.html
ggplot2 documentation: http://docs.ggplot2.org/current/
grid package documentation: https://stat.ethz.ch/R-manual/R-devel/library/grid/html/00Index.html